Corpus tools for lexicographers

نویسندگان

  • Adam Kilgarriff
  • Iztok Kosem
چکیده

To analyse corpus data, lexicographers need software that allows them to search, manipulate and save data, a 'corpus tool'. A good corpus tool is the key to a comprehensive lexicographic analysis—a corpus without a good tool to access it is of little use. Both corpus compilation and corpus tools have been swept along by general technological advances over the last three decades. Compiling and storing corpora has become far faster and easier, so corpora tend to be much larger than previous ones. Most of the first COBUILD dictionary was produced from a corpus of eight million words. Several of the leading English dictionaries of the 1990s were produced using the British National Corpus (BNC), of 100 million words. Current lexico-graphic projects we are involved in use corpora of around a billion words—though this is still less than one hundredth of one percent of the English language text available on the Web (see Rundell, this volume). The amount of data to analyse has thus increased significantly, and corpus tools have had to be improved to assist lexicographers in adapting to this change. Corpus tools have become faster, more multifunctional, and customizable. In the COBUILD project, getting concordance output took a long time and then the concordances were printed on paper and handed out to lexicographers (Clear 1987). Today, with Google as a point of comparison, concordancing needs to be instantaneous, with the analysis taking place on the computer screen. Moreover, larger corpora offer much higher numbers of concordance lines per word (especially for high-frequency words), and, considering the time constraints of the lexicographers (see Rundell, this volume), new features of data summarization are required to ease and speed the analysis. In this chapter, we review the functionality of corpus tools used by lexicographers. In Section 3.2, we discuss the procedures in corpus preparation that are required for some of these features to work. In Section 3.3, we briefly describe some leading tools

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Providing Lexicographers with Corpus Evidence for Fine-grained Syntactic Descriptions: Adjectives Taking Subject and Complement Clauses

This article deals with techniques for lexical acquisition which allow lexicographers to extract evidence for fine-grained syntactic descriptions of words from corpora. The extraction tools are applied to partially parsed text corpora, and aim to provide the lexicographer with easy to use syntactically pre-classified evidence. As an example we extracted German adjectives taking subject and comp...

متن کامل

Tools for Upgrading Printed Dictionaries by Means of Corpus-based Lexical Acquisition

We present the architecture and tools developed in the project TFB-32 for updating existing dictionaries by comparing their content with corpus data. We focus on an interactive graphical user interface for manual selection of the results of this comparison. The tools have been developed and used within a cooperation with lexicographers from two German publishing houses.

متن کامل

Detection of Domain Specific Terminology Using Corpora Comparison

Identifying terms in specialized corpora is a central task in terminological work (compilation of domain-specific dictionaries), but is labour-intensive, especially when the corpora are voluminous which is often the case nowadays. For the past decade, terminologists and specialized lexicographers have been able to rely on term-extraction tools to assist them in the selection of terms. However, ...

متن کامل

WASP-Bench: a Lexicographic Tool Supporting Word Sense Disambiguation

We present WASP-Bench: a novel approach to Word Sense Disambiguation, also providing a semi-automatic environment for a lexicographer to compose dictionary entries based on corpus evidence. For WSD, involving lexicographers tackles the twin obstacles to high accuracy: paucity of training data and insufficiently explicit dictionaries. For lexicographers, the computational environment fills the n...

متن کامل

The Berkeley FrameNet Project

FrameNet is a three-year NSF-supported project in corpus-based computational lexicography, now in its second year (NSF IRI-9618838, "Tools for Lexicon Building"). The project's key features are (a) a commitment to corpus evidence for semantic and syntactic generalizations, and (b) the representation of the valences of its target words (mostly nouns, adjectives, and verbs) in which the semantic ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011